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SUMMARY 

Five benchmark programs were obtained and run on the NASA Lewis CRAY 
X-MP/24 . A comparison was made between the programs codes and between the 
methods for calculating performance figures. Several multitasking jobs were 
run to gain experience In how parallel performance Is measured. 


INTRODUCTION 


During the past 5 yr, there has been an Increased Interest In bench- 
marking supercomputer performance. New benchmarks have been written while 
older benchmarks have been put In modern perspective. Even the National 
Bureau of Standards has begun collecting Parallel Computer Benchmark Programs 
as part of the effort of Its Computer Measurement Research Facility (CMRF) 
project. This collection, maintained by the Institute for Computer Services 
and Technology at NBS, Is open to supercomputer users so that they may borrow 
from It and contribute to It. 


Reports and articles written on a particular benchmark usually Indicate 
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and compilers. Some list quite a comprehensive range of machines Including 
machines of the same type (such as a CRAY X-MP/22) but with different operating 
systems and compilers. 


Performance figures for the same benchmark program run on the same machine 
at two or more locations can vary (due to running the program under different 
operating conditions.) An example Is given In (ref. 3). We thought It would 
be Interesting to collect a set of benchmarks and run them on the same machine 
(our X-MP/24) to gain some appreciation for and understanding of why a 
machine's performance figures can vary (sometimes greatly) depending on the 
benchmark program. 


*Summer Faculty Fellow* 


This report summarizes the results of our effort to: 

• Collect a set of different benchmark programs and run them on our 
CRAY X-MP/24 to gain experience In how performance data Is collected and how 
It can vary between benchmarks and between runs of the same benchmark. 

• Set up a means for running these programs. Then when changes are 
made to the operating system or hardware, they can be run to see what the 
effects are on the performance data. 

• Perform some Initial experiments with multitasking to determine what 
kind of performance measurement to look for. 

This benchmark collection contains In part specific routines which are 
used In scientific/engineering computing and In part segments of code which 
are a generic mix of calculations and Instructions typical of scientific/ 
engineering computing. However, It does not represent a model of the specific 
workload at NASA Lewis. 


DESCRIPTION OF THE BENCHMARKS 

Five benchmark programs were obtained. 

The NAS Kernel Benchmark Program (ref. 1) Authors: Dave Bailey and John 

Barton 

The Argonne Programs (ref. 5) Author: Jack Dongarra 

The Sandla Benchmark (SPEED) (ref. 2) Authors: T.H. Jefferson and 

M.R. Scott 

The Whetstone Benchmark (ref. 7) Authors: H.J. Curnow and B. A. Wlchmann 

The Livermore Loops Author: F.H. McMahon 

We Include here a brief description of each one. 


The NAS Kernel Benchmark 

This Is one of the more recently written benchmarks. It was developed for 
use of the NAS (National Aerodynamics Simulation) Projects Office at NASA Ames 
Research Center. It consists of approximately 1000 lines of FORTRAN code orga- 
nized Into seven tests, which are referred to as kernels. The calculations 
performed typify the type of supercomputing done at Ames. Since It Is a more 
recent benchmark the seven kernels emphasize the vector performance of a com- 
puter system. The seven kernels are: 

(1) MXM - performs matrix product on two Input matrices employing a 

four-way unrolled outer product algorithm 

(2) CFFT2D - performs a complex radix 2 Fast Fourier Transform on a two- 

dimensional Input array 

(3) CHOLSKY - performs a Cholesky decomposition on a set of Input matrices 
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(4) BTR IX - performs a block tridiagonal matrix solution along one dimen- 

sion of a four-dimensional array 

(5) GMTRY - sets up an array for a vortex method solution and performs 

Gaussian elimination on the resulting array 

(6) EMIT - creates new vortices according to certain boundary conditions 

(7) VPENTA - simultaneously Inverts three matrix pentadlagonals In a 

manner conducive to vector processing 

For a more detailed discussion, see references 1 and 4. 

The Argonne Programs 

UNPACK Is a library of FORTRAN linear algebra subroutines co-authored by 
Jack Dongarra In 1979. Over the past several years he has been publishing 
results of a benchmark program which solves systems of linear equations of 
order 100 using routines from the UNPACK collection. The latest results are 
given In Performance of Various Computers Using Standard Linear Equations Soft- 
ware In a Fortran Environment (ref. 5). 

We obtained from Argonne National Laboratory a tape consisting of nine 
files each of which Is a self contained benchmark. The following three were 
selected to be Included In our benchmark study. 

(1) A LINPACK system solver - A program and subroutines to measure timing 
of the LINPACK routines for solving a dense system of equations. 

(2) A better LU decomposition - A program consisting of a different 
Implementation of the solution of linear equations (ref. 9) using an algorithm 
based on matrix-vector operations rather than just vector operations. As 
reported In reference 5, It better reflects the true performance of a super- 
computer than the LINPACK routines. 

(3) A Vector Loop program - A program that Indicates how well a compiler 
vectorizes some standard loops. No timing results are Included. 

Two of the other files were double precision versions of (1) and (2) 
above . 

Two others (one single precision and one double precision) contained only def- 
initions and declarations that allowed you to Insert your own LU decomposition. 
The remaining one was a program to study Indirect addressing In single preci- 
sion. 


The Sandla Benchmark (SPEED) 

This Is a program written at Sandla National Laboratory which consists of 
five kernels taken from programs In use at Sandla In 1978. The five kernels 
are: 

(1) A linear equation solver with pivoting 

(2) A routine which consists of part of the predictor step of an ordinary 

differential equation solver 

(3) A routine consisting of a forward and backward substitution excerpt 

from a linear equation solver with pivoting 

(4) A routine which consists of an excerpt from a Vortex Dynamics Code 

(5) A routine which consists of an excerpt from a lattice relaxation code 
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Although the first three were taken from software library routines, the last 
two were taken from large user codes at Sandla. Thus this benchmark typifies 
the workload there. 


The Whetstone Benchmark 

This Is a synthetic benchmark developed In the early 70' s by Curnow and 
Wlchmann at U.K.'s National Physical Laboratory In Whetstone, England. It's 
somewhat unique with respect to the above three In that It was developed to 
match Instruction frequency statistics of language usage (originally ALGOL) 
collected from programs run at that laboratory. The resulting program Is said 
to represent the execution of one million Whetstone Instructions. The Inverse 
of the measured run time Indicates millions of Whetstone Instructions per 
second. Our copy Is a FORTRAN version which was obtained from the Computer 
Measurement Research Facility at the National Bureau of Standards. 


The Livermore Loops 

This program was developed at the Lawrence Livermore National Laboratory, 
Livermore, Ca. by F.H. McMahon, and had Its Initial beginnings In the late 
60 1 s and early 70 ' s . Work was sponsored by the DOE. The version we obtained 
consists of 24 kernels (or loops) each consisting of a relatively small extract 
from a CPU - limited scientific application program. These computational 
structures are considered to be the most Important CPU time components from the 
applications. They are: 


Kernel 

Description 

1 

Hydro Fragment 

2 

Incomplete Cholesky Conjugate Gradient Excerpt 

3 

Inner Product 

4 

Banded Linear Equations 

5 

Tri-diagonal Elimination - Below Diagonal 

6 

General Linear Recurrence Relation 

7 

Equation of State Fragment 

8 

ADI Integration 

9 

Integer Predictors 

10 

Difference Predictors 

11 

First Sum 

12 

First Difference 

13 

2-D1mens1onal Particle In Cell 

14 

1-Dlmenslonal Particle In Cell 

15 

Casual FORTRAN. Development Version 

16 

Monte Carlo Search Loop 

17 

Implicit, Conditional Computation 

18 

2-D1mens1onal Explicit Hydro Fragment 

19 

General Linear recurrence Equations 

20 

Discrete Ordinate Transport 

21 

Matrix * Matrix product 

22 

Plancklan distribution 

23 

2-D1mens1onal Implicit Hydro Fragment 

24 

Location of 1st minimum In array 
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Observations 


The NAS kernels, the Argonne programs, the Sandia benchmark, and the 
Livermore Loops produce timing Information In MFLOPS (millions of floating 
point operations per second) by dividing the number of floating point opera- 
tions by the CPU time. However, the methods for counting the number of float- 
ing point operations differ. 

In the NAS Kernel Benchmark, the number of floating point operations for 
each kernel Is computed as follows. Each function operation (+, -, /, SQRT, 
SIN, etc.) has a precise number of floating point operations (weight) associ- 
ated with It. For example, the additions of two real counts as one floating 
point operation while the division of real by a real counts as three. A com- 
plete table showing the number of floating point operations for the various 
functions Is given In reference 1. The total number of floating point opera- 
tions for each kernel Is obtained by summing the products of the number of 
occurances of each function operation and Its weight. 

Only the first two Argonne programs listed above give a MFLOP rate. In 
the LINPACK system solver, the number of operations In computed using a well- 
known formula which approximates the number of additions and multiplications 
for solving a system of n equations In n unknowns. The formula Is a function 
of the order of the matrix. The third program chosen gives no timing Informa- 
tion since It only tests vectorlzatlon capability. 

The Sandia Benchmark defines a floating point operation as an add, sub- 
tract, multiply, or divide, with each operation counting equally. 

In the Livermore Loops program, floating point operations are counted 
according to the following weights: 


+, * 

1 

/, SQRT 

4 

EXP, SIN, etc. 

8 

IF (X rel. Y) 

-• 

1 


The sum of the products of the number of occurances of each of these oper- 
ations and Its weight gives the FLOPS. This weight association Is different 
from the one used In the NAS benchmark. 

The Livermore Loops produces the most comprehensive set of statistics. 
Also, unique to this benchmark Is a harmonic mean among the kernel rates. It 
can be argued that the harmonic mean (or more generally the weighted harmonic 
mean) Is more meaningful than the arithmetic mean (refs. 3 and 8). 


BENCHMARK RESULTS 

The five benchmark programs provided the following results on our CRAY 
X-MP/24 running under COS 1.14 BF4. Each program was executed In dedicated 
time, meaning that this was the only job executing In the system at the time. 
All other jobs. Including diagnostics, were suspended. These were runs made 
without any changes to the program codes we received (Level 0 tests as defined 
In (ref. 1 )) . 
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THE NAS KERNEL BENCHMARK 


Program 

MFLOPS 

MXM 

136 

CFFT2D 

51 

CHOLSKY 

53 

BTRIX 

80 

GMTRY 

70 

EMIT 

82 

VPENTA 

41 

Total 

65 


The total MFLOPS represents the ratio of the sum of the floating point 
operations (FP OPS) for each kernel to the sum of the times for each kernel. 

THE ARGONNE PROGRAMS 

Program 1. UNPACK System Solver 
MFLOPS 

22 (system of order 100) 

The routines SGEFA, SGFSL from LINPACK perform standard LU decomposition 
with partial pivoting and back substitution. This is a FORTRAN version with 
simple statements and simple loops. 

Program 2. A Better LU Decomposition 

Array dimensions 301 


Order 50 


Order 200 


Unrolled Depth 

MFLOPS 

1 

17 

2 

22 

4 

25 

8 

27 

16 

26 


Order 100 


Unrolled Depth 

MFLOPS 

1 

53 

2 

70 

4 

81 

8 

94 

16 

96 


Order 250 


Unrolled Depth 

MFLOPS 

1 

32 

2 

42 

4 

50 

8 

57 

16 

57 


Unrolled Depth 

MFLOPS 

1 

61 

2 


4 

91 

8 


16 

108 
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Order 150 


Order 300 


Unrolled Depth 

MFL0PS 

1 

44 

2 

58 

4 

68 

8 

78 

16 

79 


Unrolled Depth 

MFLOPS 

1 

68 

2 

88 

4 

99 

8 

115 

16 

117 


The algorithm for this LU decomposition, whose description can be found In 
reference 9, Is based on matrix-vector operations rather than just vector oper- 
ations. Notice that timings are given for matrices of six different orders 
ranging from 50 to 300. 

Program 3. Vector Loop Program. 

The following table gives annotations of the types of loops (17 In all) and 
whether or not they vectorized. 


Loop 

Vectorized 

1 Statements In wrong order 

n 

2 Dependency needing a temporary 

y 

3 Loop with unnecessary scalar store 

y 

4 Loop with ambiguous scalar temporary 

n 

5 Loop with subscript that may seem ambiguous 

y 

6 Recursive loop that really Isn't 

y 

7 Loop with possible ambiguity because of scalar store 

n 

8 Loop that Is partially recursive 

n 

9 Loop with unnecessary array store 

y 

10 Loop with Independent conditional 

n 

11 Loop with noninteger addressing 

y 

12 Simple loop with dependent conditional 

y 

13 Complex loop with dependent conditional 

y 

14 Loop with singularity handling 

y 

15 Loop with simple gather/scatter subscripting 

n 

16 Loop with multiple dimension recursion 

n 

17 Loop with multiple dimension ambiguous subscripts 

y 


THE SANDIA BENCHMARK (SPEED) 


Kernel 

MFLOPS 

1 

23 

2 

11 

3 

39 

4 

10 

5 

8 

Total 

13 
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In reference 2, a modified version of this program Is reported to give a 
total of 36.8 MFLOPS, and an unmodified version Is reported to show a total of 
10 MFLOPS. 


THE WHETSTONE BENCHMARK 

The total CPU time for executing the program found by subtracting two calls 
to the function SECOND was 


Time (T) (1/T) 

.04030 secs. 25 MWIPS 

MWIPS = millions of Whetstone Instructions per second. 

THE LIVERMORE LOOPS 


Mean Vector Length = 468 


Kernel 

MFLOPS 

Span 

1 

152 

1001 

2 

26 

101 

3 

137 

1001 

4 

44 

1001 

5 

6 

1001 

6 

13 

64 

7 

171 

995 

8 

113 

100 

9 

145 

101 

10 

65 

101 

11 

8 

1001 

12 

71 

1000 

13 

4 

64 

14 

11 

1001 

15 

5 

101 

16 

3 

75 

17 

9 

101 

18 

112 

100 

19 

7 

101 

20 

12 

1000 

21 

29 

25 

22 

66 

101 

23 

13 

100 

24 

2 

1001 


MFLOPS Range = 2 to 171 
Harmonic Mean =10 
Median Rate = 26 
Median Dev. = 40 
Average Rate = 51 
Standard Dev. = 55 
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The Interesting and probably more meaningful Indicator of overall 
performance Is the harmonic mean computed with equal weights attached, l.e.. 


i 24 
1=1 


where Ri = the MFLOPS of Kernel 1, 1=1, ..., 24. Each kernel above was 
assigned a weight of 1 . As explained In references 3 and 8, the harmonic mean 
Is a more meaningful representation of the actual performance and measure of 
the workload. This can be particularly true If there Is a significant differ- 
ence between the best and worst rate among all kernels or measured performances 
In a benchmark. 


In addition, two other sets of MFLOP rates are output In the same form as 
above, one with a mean vector length of 89 and one with a mean vector length of 
18. In each case, most of the spans or vector lengths were reduced. In case of 
the MVL of 89, the spans of 1001, for example, were dropped to 101, while in 
case of the MVL of 18, spans of 1001 were dropped to 27. A weight of 2 was 
attached to each kernel In the MVL = 89 case, while a weight of 1 was attached 
to each kernel In the MVL = 18 case. In general, for each particular kernel, 
the MFLOP rate decreased If the span decreased. However, loops were repeated 
enough times to keep the order of magnitude of the FLOPS about the same. In 
some cases the MFLOP decrease was significant. For example, kernel 1 rates 
were 


Span 

MFLOPS 


1001 

152 


101 

114 


27 

64 


lid not vary much. 

MVL 

MFLOPS (Harmonic Mean) 

468 

10 

89 

9 

18 

6.5 
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MULTITASKING RESULTS 


This section reports on the results of two experiments. Experiment 1 
Involved running two copies of the same program, one on each of the two proces- 
sors of our XMP/24 simultaneously. Experiment 2 Involved actually multitasking 
one of the benchmarks. 


EXPERIMENT 1 

Bailey and Barton (ref. 1) suggested that It would be possible to get an 
Idea of the amount of Interprocessor resource contention which would occur when 
a particular program was multitasked by executing that program simultaneously 
on each of the Individual processors. We did this for two of the benchmarks, 
the NAS Kernel Benchmark and the Vector Loop Program from Argonne. No explicit 
multitasking was done. The procedure was simply to suspend all jobs In the 
system and execute two Identical, Independent copies of the benchmark simulta- 
neously on two (2) processors. The results of these simultaneous runs are given 
below, along with a single dedicated run for comparison. 


NAS KERNEL BENCHMARK 



Wall clock 

CPU 

MFLOPS 

2 programs run 

35.34 

34.6813 

62.64 

simultaneously 

35.36 

34.6797 

62.64 

Single Run 

33.84 

33.4193 

65.01 


VECTOR LOOP PROGRAM 



Total wall 
Clock time 

2 programs run 

91 .58 

simultaneously 

91.52 

Single Run 

90.80 


Both the NAS kernels and Vector Loop programs were modified slightly by 
Inserting calls to the function TIMEF (which gives wall clock time). This was 
done so that an approximate maximum speedup could be calculated. 

The Vector Loop Program was further modified so that Instead of every tenth 
array element being printed out, as Is done In the original benchmark, every 
array element Is printed out (the parameter PRTINC was changed from 10 to 1). 
This was done In order to see what effect a large amount of I/O would have on 
Interprocessor resource contention. 

The speedup Is calculated as suggested In the CRAY Multitasking Users 
Guide: 

execution time of uniprocessor run 
speedup = execution time of multiprocessor run 
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where the uniprocessor run refers to the run of the original unmodified program 
executing on one CPU In dedicated time, and the multiprocessor run represents 
the run of the multltasked program executing on two CPU's In dedicated time. 

In our example the execution time of the multiprocessor run Is approximated 
by taking the average of the wall clock times of the two simultaneous runs. 

The execution time of the uniprocessor run Is approximated by doubling the 
wall clock time of the unmodified program run. The results are shown below: 

NAS KERNEL BENCHMARK 

3.84 x 2 , . 

Speedup = — 35 = 1.914 

VECTOR LOOP PROGRAM 
Speedup = 9 ° 9 ?° 55 " ~ = l- 983 

As Bailey pointed out, this Is a relatively easy way of making some estimates 
about multitasking. It Is not true multitasking. 


EXPERIMENT 2 

In this effort, we decided to gain some experience with multitasking. To 
save time from developing an algorithm and program ourselves, we looked for a 
benchmark program from the collection that could be quickly and easily multl- 
tasked. The Vector Loop Program was chosen for a number of reasons. First of 
all, It has the largest execution time of any of the benchmarks (approximately 
90 seconds). Second, the structure of the program Is amenable to multitasking. 
It Is structured as one major loop which consists of 17 well defined. Indepen- 
dent minor loops. One execution of the major loop causes all 17 of the minor 
loops to be executed once. Because these minor loops are Independent, any 
combination of them could be executed concurrently. 

The first step In multitasking was to acquire more Information about the 
execution times of the various parts of the program and to determine the use 
and scope of the data. Two utilities are available to give more Information 
about the execution time of various parts of the program. Flowtrace summarizes 
the number of calls to subroutines and what portion of the program's time Is 
spent In those subroutines. SPY samples the program while It Is executing and 
reports on the number of times It found the program working In certain label 
groupings as It samples. It can be used to Identify frequently executed 
portions of the program. Neither of these utilities provided the necessary 
Information for this particular program and therefore the TIMEF function was 
Inserted In the code at appropriate places In order to determine the time used 
by each of the 17 Individual loops. 

Another utility, FTREF, was found to be extremely useful In analyzing the 
use and scope of the data. Using this Information, the program was split by 
placing some of the minor loops In a subroutine and having the main program 
call this subroutine as a task with a call to TSKSTART. The strategy for 
determining the split of the two parts was twofold. First and foremost, we 
want the time spent In the subroutine to be approximately equal to the time 
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spent In the main program (between the call to TSKSTART and TSKWAIT) . If 
possible, we would like the subroutine to finish slightly ahead of the main 
program because the overhead for executing a TSKWAIT Is considerably less If 
the program making the call to TSKWAIT (In our case the main program), doesn't 
have to wait. The second thing that was considered when splitting the program 
was memory contention. Even though the 17 loops were Independent, some of them 
were still accessing the same memory (shared, read only memory). We did not 
want 2 parts trying to read the same piece of data at the same time. Keeping 
this Information In mind as well as the timing requirement, several different 
versions were created and tested. The results for the most successful version 
are given below. The time for a dedicated run of a single execution of the 
original version of Vector Loop Program Is given for comparison. 



Total wall 
clock time 

Speedup of 

multltasked 

version 

Single Run 
Multltasked Run 

89.57 

48.05 

1 .86 


It must be kept In mind that this was a rather simplistic exercise In 
multitasking. It was a relatively small piece of code which had few depend- 
encies and which required very little synchronization. It did, however, allow 
us to gain some experience In, and appreciation for, the Intricacies of multi- 
tasking. 


FINAL REMARKS 

It Is difficult to draw conclusions from a comparison of these benchmarks. 
Part of the reason Is due to the Inherent difficulties Involved In performance 
testing Itself, as explained by Worlton (ref. 3) and reinforced by Bailey 
(ref. 1). That Is, sometimes runs are made with tuned versions of benchmarks 
Involving minor or major changes which exploit the best features of a compiler 
or architecture. Sometimes compiler versions are not noted, or differences In 
operating system conditions are not exposed. 

We ran all of these benchmark programs as received. No changes were made. 
An Interesting and Important point to make Is the following. All five of them 
are different, but parts of some of them do similar things. Suppose we observe 
a single figure (rate) from each of the benchmarks and compare them, e.g., 


NAS Kernels Total MFLOPS ! 

65 

LINPACK System Solver 

22 

Sandla SPEED Total MFLOPS 

13 

Whetstone MWIPS 

25 

Livermore Loops Harmonic Mean 

10 


The Sandla SPEED program gives the lowest performance figure while the 
NAS kernels program gives the highest figure. However, this comparison Is not 
even meaningful because as we explained In section 2, the meaning of MFLOPS 
differs among all five due to the fact that each one counts floating point , 
operations In a different way. Furthermore, one can Increase these figures by 
modifying (or tuning) the programs In various ways. 


12 





These benchmarks might be used in a relative sense to see how rates are 
affected when changes are made to the compiler or operating system. Also, one 
could choose a particular routine or a particular segment of code from this 
benchmark collection and study Its timing Information. However, It's clear 
that before making comparisons or trying to draw conclusions, one should 
understand how rates (e.g. MFLOPS) are calculated, exactly what kind of 
algorithms or calculations the code Is doing, and whether or not the algorithms 
and code are written to exploit the features of a compiler or architecture. 
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